High-throughput sequencing and genomes

Jelmer Poelstra

CFAES Bioinformatics Core, Ohio State University

2026-01-29

Introduction to sequencing technologies

What do we mean by sequencing?

What does the term sequencing, like in “high-throughput sequencing”, generally refer to?  

Determining the sequence of DNA, RNA, or protein fragments.

Most commonly, especially in “high-throughput” sequencing, it refers to DNA sequencing specifically.

This week and next, we will focus on DNA sequencing only, keeping in mind that:

  • Protein sequencing involves completely different technologies
  • RNA can be, and usually is, sequenced via DNA sequencing

    How is that done and why?  

    RNA is usually reverse transcribed to DNA (cDNA) prior to sequencing.

    While it is becoming more feasible to directly sequence RNA molecules, RNA is an unstable molecule that is easily degraded and harder to sequence.

Overview of sequencing technologies

  • Sanger sequencing (since ~1985)
    Sequences a single, typically PCR-amplified, short-ish (≤900 bp) DNA fragment at a time

  • High-throughput sequencing (HTS)
    Sequences 105-109, usually randomly selected, DNA fragments at a time — two types:

    • Short-read HTS: More accurate, shorter reads (since 2005)

    • Long-read HTS: Less accurate, longer reads (since 2011)


These sequenced fragments of DNA are usually called reads

Sanger sequencing

  • Sanger sequencing almost always starts with PCR amplification of the target DNA region —
    as illustrated by Dr. Popp last week:
  • Therefore, you must know something about the target sequence in advance to design primers

Sanger sequencing

  • Sequencing itself is performed by synthesizing a new DNA strand with fluorescently-labeled nucleotides, using a different color for each base (A, C, G, T)

  • The final result is a chromatogram that can be “base-called”:

https://dnacore.mgh.harvard.edu/new-cgi-bin/site/pages/sequencing_pages/seq_troubleshooting.jsp


The entire human genome was sequenced with Sanger technology!

How many basepairs is that? Want to guess how much this cost?

Sequencing cost through time

https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost

Present-day Sanger applications

  • Because HTS has much higher throughput and is much cheaper per base, Sanger sequencing has become less widely used

  • But it is not obsolete, in part because high throughput isn’t always needed or even wanted

An AI-generated image showing a giant peanut butter jar.

Image generated by Adobe Firefly

Present-day Sanger applications

Some present-day uses of Sanger sequencing include:

  • Taxonomic identification of samples

  • Examining variation among individuals or populations in one or a few candidate or marker genes

High-throughput sequencing (HTS)

Omics

Let’s start with the big picture – HTS data underlies several of these main “omics” approaches:

A diagram showing the main omics data types.

Copyright ThermoFisher

The main omics data types

Omics type Molecule type
Genomics DNA
Epigenomics DNA modifications High-throughput sequencing (HTS)
Transcriptomics RNA
Proteomics Proteins
Metabolomics Metabolites


What does the -omics suffix mean?  

The “omics” suffix indicates the involvement of large-scale datasets — in the sense that, for example, “genomics” data typically spans much or all of the genome.

While the boundaries can be fuzzy, sequencing a single gene in a single organism is not genomics, and running qPCR for a handful of genes is not transcriptomics.

The main omics data types

Omics type Molecule type Data mainly produced by
Genomics DNA High-throughput sequencing (HTS)
Epigenomics DNA modifications High-throughput sequencing (HTS)
Transcriptomics RNA High-throughput sequencing (HTS)
Proteomics Proteins Mass Spectrometry
Metabolomics Metabolites Mass Spectrometry

Examples of HTS applications

  • Whole-genome assembly — for producing reference genomes
  • Typing of SNPs and other sequence variants — for population genetics, GWAS, etc.
  • 16S Metabarcoding – for microbial community characterization
  • RNA-Seq — for large-scale gene expression analysis


Does any of you work on or is planning to work on projects like this?  

Examples of HTS applications

  • In each of of these applications:
    • What part(s) of the genome are you sequencing?
    • Who are you sequencing (how many individuals/species)?
    • How are the sequences used?

  • Whole-genome assembly
    • The whole genome! 😃
    • A single individual
    • Reads are “overlapped” into larger fragments (chromosomes if possible)

Examples of HTS applications

  • In each of of these applications:
    • What part(s) of the genome are you sequencing?
    • Who are you sequencing (how many individuals/species)?
    • How are the sequences used?

  • Typing of SNPs and other sequence variants
    • Varies! Can be the whole genome (“resequencing”), but also specific (e.g. exome) or random (GBS/RAD) regions
    • Many individuals, which are compared
    • For each site & individual, the variant state (allele) is determined

Examples of HTS applications

  • In each of of these applications:
    • What part(s) of the genome are you sequencing?
    • Who are you sequencing (how many individuals/species)?
    • How are the sequences used?

  • 16S Metabarcoding
    • A specific locus (e.g., 16S rRNA gene for bacteria) – but across species!
    • Many samples, which can be soil, water, gut contents, etc.
    • Sequences are assigned a taxonomic identity and then counted

Examples of HTS applications

  • In each of of these applications:
    • What part(s) of the genome are you sequencing?
    • Who are you sequencing (how many individuals/species)?
    • How are the sequences used?

  • RNA-Seq The entire transcriptome

Illustration of HTS

TBA

The main HTS technologies

Short-read HTS Long-read HTS
Main companies Illumina Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio)

The main HTS technologies

Short-read HTS Long-read HTS
Usage More Less (but increasing)
Main companies Illumina Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio)
Timeline Since 2005 — technology fairly stable Since 2011 — still rapid development

The main HTS technologies

Short-read HTS Long-read HTS
Usage More Less (but increasing)
Main companies Illumina Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio)
Timeline Since 2005 — technology fairly stable Since 2011 — still rapid development
Read lengths 50-300 bp 10-100+ kbp
Error rates Mostly <0.1% 1-10% (ONT) / <0.1-10% (PacBio)

The main HTS technologies

Short-read HTS Long-read HTS
Usage More Less (but increasing)
Main companies Illumina Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio)
Timeline Since 2005 — technology fairly stable Since 2011 — still rapid development
Read lengths 50-300 bp 10-100+ kbp
Error rates Mostly <0.1% 1-10% (ONT) / <0.1-10% (PacBio)
Throughput Higher Lower
Cost per base Lower Higher

Read lengths

Can you think of applications where long reads are useful?  

For example:

  • Genome assembly
  • Read-based taxonomic identification

Can you think of applications where read length may not matter much?  

For example:

  • (SNP) variant analysis
  • Counting applications such as RNA-Seq

Here, genomic locations (variant analysis) or gene identities (RNA-Seq) can be reliably inferred from as little as 25 bp.

Error rates

A read’s sequence may differ from the actual DNA sequence it came from:

  • The read can have base-calling errors, missing bases, or extra bases
  • When the base calling software is not confident, it can also return Ns (= undetermined)

A chromatogram with several uncalled bases.

When you receive HTS reads, base calls have typically been made already.
Every base call is accompanied by a quality score, representing the estimated error probability.

Correcting sequencing errors

To overcome sequencing errors, every base can be sequenced multiple times –
i.e., obtaining a “depth of coverage” greater than 1:

A diagram illustraing the concept of depth of coverage.

Which natural phenomenon might complicate this effort?   Genetic variation among and (for diploid organisms) within individuals

Typical depths of coverage are ~50-100x for genome assembly and 10-30x for “resequencing” (!)

Short-read HTS

Libraries and library prep

  • In a HTS context, a “library” is a collection of DNA fragments ready for sequencing.

  • These fragments can number in the millions or billions and are often randomly generated from input like genomic DNA:

A diagram showing the main Illumina library preparation steps.

An overview of the library prep procedure. This is typically done for you by a sequencing facility or company.

Libraries and library prep

  • After library prep, each DNA fragment is flanked by several types of short sequences that together make up the “adapters”:



Multiplexing!

Adapters can include “indices” or “barcodes” to identify individual samples, so many samples can be combined (multiplexed) into a single library.

Paired-end vs. single-end sequencing

  • DNA fragments can be sequenced from both ends as shown below —
    this is called “paired-end” (PE) sequencing:

A diagram showing forward and reverse reads in paired-end sequencing.


  • When sequencing is instead single-end (SE), no reverse read is produced:

Fragment size variation

  • DNA fragment size varies – by design and because of limited precision in size selection

A diagram showing forward and reverse reads in paired-end sequencing.

What happens when it’s shorter than the length of a single F or R read?  

Adapter read-through”: the final bases in the resulting reads will consist of adapter sequence, which should be removed before downstream analysis

A diagram illustrating the scenario when the DNA fragment is shorter than the single read length

Fragment size variation

  • DNA fragment size varies – by design and because of limited precision in size selection

A diagram showing forward and reverse reads in paired-end sequencing.

What happens when it’s shorter than the combined F + R read length?  

Overlapping reads (this can be useful!):

A diagram illustrating the scenario when the DNA fragment is shorter than the combined read length

How Illumina sequencing works

Sequencing is performed by synthesizing a new strand using fluorescently-labeled bases and taking a picture each time a new nucleotide is incorporated:

How Illumina sequencing works

Sequencing is performed by synthesizing a new strand using fluorescently-labeled bases and taking a picture each time a new nucleotide is incorporated:

How Illumina sequencing works

Sequencing is performed by synthesizing a new strand using fluorescently-labeled bases and taking a picture each time a new nucleotide is incorporated:

The scale of Illumina sequencing

The scale of Illumina sequencing

The scale of Illumina sequencing

The scale of Illumina sequencing

Video of Illumina technology

Reference genomes

Reference genomes

Many HTS applications either require a “reference genome” or involve its production.
What exactly does reference genome refer to? It usually includes:

  • An assembly
    A representation of most or all of the genome DNA sequence: the genome assembly

  • An annotation
    Provides e.g. locations of genes and other genomic “features” in the corresponding genome assembly, and functional information for these features


Taxonomic identity

Reference genomes are typically applicable at the species level. For example, if you work with maize, you want a Zea mays reference genome. But:

  • If needed, it’s often possible to work with genomes of closely related species
  • Conversely, different subspecies/lines may have their own reference genomes

There is enormous variation in genome size

https://en.wikipedia.org/wiki/Genome_size

And enormous growth of genomes in databases


Konkel and Slot (2023)

Genome assemblies

  • With increasing usage & quality of long-read HTS, assemblies are getting better and better

  • Chromosome-level assemblies require additional technologies (e.g., Hi-C)

  • Many assemblies instead consist of –often 1000s of– fragments (contigs and scaffolds)


How is this data stored?

Both genome assemblies and annotations are typically saved in a single text file each — we’ll explore some of these files in tomorrow’s lab.

Recap

You’ve learned:

  1. That high-throughput sequencing (HTS) enables large-scale DNA sequencing

  2. How short-read and long-read HTS have different strengths and weaknesses

  3. About libraries and the technology underlying short-read sequencing

  1. That reference genomes are essential for many HTS applications

Looking forward

The Garrigós et al. 2025 dataset

The labs this and next week are organized around the data set from Garrigós et al. (2025):

A screenshot of the paper's front matter.

This paper uses paired-end Illumina RNA-Seq data to study gene expression in Culex pipiens mosquitos infected with two different malaria-causing Plasmodium protozoans.

Tomorrow’s lab

  • You will first learn about what kind of computing environment is commonly used for analyzing high-throughput sequencing (HTS) data
  • Then, you’ll work in this environment to explore an HTS dataset, checking out HTS read and reference genome files, and performing read quality control

Next week’s content

  • In the lecture, you will learn about RNA-Seq, the state-of-the-art transcriptomics method
  • In the lab, you will perform differential gene expression analysis on the Culex pipiens RNA-Seq dataset

References

Garrigós, Marta, Guillem Ylla, Josué Martínez-de la Puente, Jordi Figuerola, and María José Ruiz-López. 2025. “Two Avian Plasmodium Species Trigger Different Transcriptional Responses on Their Vector Culex pipiens.” Molecular Ecology 34 (15): e17240. https://doi.org/10.1111/mec.17240.
Konkel, Zachary, and Jason C. Slot. 2023. “Mycotools: An Automated and Scalable Platform for Comparative Genomics.” BioRxiv. https://doi.org/10.1101/2023.09.08.556886.